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Abstract 



We investigate two options for performing Bayesian inference on spatial log-Gaussian 
Cox processes assuming a spatially continuous latent field: Markov chain Monte Carlo 
(MCMC) and the integrated nested Laplace approximation (INLA). We first describe 
the device of approximating a spatially continuous Gaussian field by a Gaussian Markov 
random field on a discrete lattice, and present a simulation study showing that, with 
careful choice of parameter values, small neighbourhood sizes can give excellent approx- 
imations. We then introduce the spatial log-Gaussian Cox process and describe MCMC 
and INLA methods for spatial prediction within this model class. We report the results 
of a simulation study in which we compare MALA and the technique of approximating 
the continuous latent field by a discrete one, followed by approximate Bayesian inference 
via INLA over a selection of 18 simulated scenarios. The results question the notion 
that the latter technique is both significantly faster and more robust than MCMC in this 
setting; 100,000 iterations of the MALA algorithm running in 20 minutes on a desktop 
PC delivered greater predictive accuracy than the default INLA strategy, which ran in 4 
minutes and gave comparative performance to the full Laplace approximation which ran 
in 39 minutes. 



1 Introduction 



The primary aim of this article is to provide an objective comparison of two methods for 
Bayesian Inference in the spatial log-Gaussian Cox process: a relatively slow but asymp- 
totically exact method, Markov chain Monte Carlo (MCMC); and a faster but approximate 
method, the integrated nested Laplace approximation (INLA). The secondary aim is to pro- 
vide a tutorial in some of the technical aspects involved with computation and inference for 
this class of models. 

The log- Gaussian Cox process is but one of a number of possible model classes that we 
could have used as the basis for a comparative evaluation of MCMC and INLA methods. 
Our specific motivation for focusing on this model is its use in spatial epidemiology, and 
specifically in health surveillance applications, where interest is in the predictive probability 
that the relative risk of disease at a certain spatial location exceeds a threshold set by public 



health experts. For an example in the spatio-temporal setting see Diggle et al. (2005). 
There, the data consisted of locations of incident cases each day, i.e. a spatio-temporal point 
process, and the Cox process was used to represent spatio-temporal variation in risk as a 
product of deterministic and stochastic terms representing, respectively, known risk-factors 
and unexplained variation, prediction of which was the primary goal. A high predictive 
probability that risk in a particular area exceeds the pre-declared threshold may activate 
a costly public health intervention, hence it is important that such predictive probability 
statements are as accurate as possible. For this reason we will compare MCMC and INLA 
using a metric that directly measures their ability to make accurate predictive probability 
statements. 

MCMC methods have made an enormous impact on statistical practice by making Bayesian 
inference tractable for complex statistical models, including models whose specification in- 
cludes a latent Gaussian process. However, they can be computationally burdensome and, 
more importantly, their inferential validity rests on the convergence of a Markov chain to its 
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equilibrium distribution, which can be difficult to verify empirically. INLA (Rue et al. 2009) 
is a recently developed competitor to MCMC methods. By using a combination of analytical 
approximation and numerical integration rather than Monte Carlo simulation, INLA circum- 
vents the convergence issues that arise with MCMC methods, and typically leads to quicker 
computation. However, the price paid is that the analytic approximations potentially intro- 
duce errors in the calculation of posterior probabilities. The goal of the simulation study in 
this paper is to assess the trade-off between faster computation and errors of approximation. 

Our focus is on predicting (functions of) a latent spatially continuous Gaussian process y, 
which we approximate by a Gaussian field, Y, on a finely spaced, regular square grid of 



points on the plane. This is in contrast to the methods discussed in Lindgren et al. (2011 ), in 
which a representation is constructed on a triangulation of a set of irregularly spaced points. 
Lindgren et al. (2011 ) assume that y has Matern second order structure, that is for u,v £ M^, 



cov{u,v) = Y^j^^^jziii^Wv - uWYKjyiKl 



V — u 



where VjK > and Ki, is a modified Bessel function of the second kind. The major advantage 
of their approach is its low computational cost: for any choice of the admissible parameters 
of the covariance function, they are able to compute the precision matrix of the GMRF 
approximation in 0{n) time, where n is the number of triangulation points. We prefer to 
retain greater flexibility in choosing an appropriate model for the covariance, allowing data 
and scientific knowledge to inform this choice. For this reason, we focus instead on the method 



described in Chapter 5 of Rue and Held ( 2005 ) , which allows effectively any covariance model 
to be fitted to the data. 

In Section [2| we discuss the approximation of a spatially continuous Gaussian process by 
a Gaussian Markov random field, and give the results of a simulation study detailing the 
effectiveness of this procedure. In Section[3| we describe the spatial log-Gaussian Cox process. 
Sections |4] and [5] give details of the MCMC and INLA methods, respectively. Section [6] 
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summarises the findings. Section [7] is a concluding discussion. Throughout the article, we 
use vr to denote a generic probability density function. 



All of the methods discussed in the article are implemented in the R package Igcp; see Taylor 



et al. (2011). 



2 Spatially Continuous Gaussian processes and their Approx- 
imation by Gaussian Markov Random Fields 

In this section, we consider how to approximate a spatially continuous Gaussian process by 
a GMRF and, via simulation, how good such an approximation is. Note that a discussion of 



this topic is given in Chapter 5 of Rue and Held (2005). To begin, we introduce some more 
general concepts. 

2.1 Theory 

A spatially continuous Gaussian process, y, is a real-valued continuous Gaussian process 
on the plane, M?. This means that 3^ is a continuous function from — )• M with the 
property that for any finite collection of locations, {si}f^i, the joint distribution of the 
random variables representing the value of the process at each of the locations, {y{si)}, is 
multivariate Gaussian, y is called strictly stationary if K{y{si)) = a for some a € M and 
any spatial location Sj and strictly second-order stationary and isotropic, if the covariance 
between y{si) and y{sj) only depends on the Euclidean distance between Si and Sj, denoted 



by 1 1 Si — Sj 1 1 (Gelfand et al. , 2010). The covariance between tt, u E M will be assumed to 
have the form, 

cov(n,f) = (T^r(||f — u\\/(l)), 
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where r is a standard isotropic correlation function: for example a Matern function. The 
parameter a dictates the point-wise variability of the field, whilst the scale parameter (p 
governs the rate of decay of the correlation in space. In what follows, an italic Roman, Y, 
will be used to denote the values of 3^ at a finite set of locations in space. We say that Y 
represents the process y. 

A Gaussian Markov random field is a collection of random variables, Y = {Yi, . . . , Y^}, that 
have a multivariate Gaussian distribution, Y ~ MVN(/1, Q~^), where for any j, 

[Yj\Y-j] = [1^1 the neighbours of f^]. 



where Y-j denotes {Yi, . . . , Yj-^i, Yj+i, . . . , Y^} and [ • ] means 'the distribution of. We use Y 
Y to distinguish between respectively the Gaussian field and the Gaussian Markov random 
field representation of a process y at the same finite set of locations. The neighbours of j, JVj, 
are usually a much smaller subset of Y; all other elements of Y are conditionally independent 
of Yj, given Mj. The pattern of conditional independence is evident in the precision matrix, 
Q: Theorem 2.2 in 



Rue and Held 



(2005) states that. 



Y ALYj\AfiUj^j 



Qij = 0. 



In the case that the neighbourhoods of each element are very small subsets of Y, the matrix 
Q is sparse. This allows otherwise computationally prohibitive operations, such as matrix 
inversion, to be implemented with fast algorithms. 

To simplify matters, consider a square observation window, W, on which a spatially continu- 
ous Gaussian process is represented by a finite collection of random variables Y = {l^j}^^;^ 
spaced on a regular square grid , G = {Gij}^'^^^, where the Gij are the centroids of grid 
cells, that cover W. For computational reasons to be explained, we assume that M = 2™ 
for some positive integer m. In order to obtain a representation of the stationary second 
order structure of the process, it is necessary to extend this grid, typically to a grid of size 
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2M X 2M, which is wrapped on a torus. This action gives rise to the notion of a toroidal 
distance metric, by which is meant the minimum distance between two points, travelling 
either directly (e.g. if the points were very close together on the torus) or around the minor 



and/or major radii. A precise definition is given in M0ller et al. (1998). 



Let y°^* = {y^f be random variables at grid locations G^^^* = {Gff j^jl^ on the 

extended space. Note that in cases where the spatial decay parameter, 0, is quite quite large 
compared with the size of W , then in order to obtain a valid covariance structure, the grid 



may have to be extended further, e.g. onto a 4M x 4M toroidal grid, see for example M0ller 
et al. (1998). The covariance matrix, Scxti of the discrete field y^^* on the extended grid is 



typically massive, dense and with a dense inverse, Qext- As an example, for a 128 x 128 grid 
in the extended space, the covariance matrix has dimension 16384 x 16384: the storage and 
manipulation of such matrices under ordinary circumstances is not computationally feasible 
on a desktop PC. However, in the extended space, Sgxt is block-circulant (see below) and 
symmetric positive definite (SPD) with a block circulant SPD inverse, Qoxt- The symmetry 
induced in the covariance matrix by extending the grid and wrapping it on a torus means 
that each entry of Sgxt is one of exactly (2M)^ = 128^ = 16, 834 elements, instead of a 
possible 16, 384^^ = 268, 435, 456 different elements. Furthermore, matrix computation in the 
extended space is greatly aided by using the discrete Fourier transform (DFT), which is why 
the focus of this article is on grids of dimension 2™ x 2*". In fact, it is possible to drop this 
assumption, at the price of reduced speed in the DFT. Algorithms are available to construct 
optimised computational plans for implementing the DFT on other grid sizes, for example 



the FFTW library (Frigo and Johnson, 2011) and an R wrapper library (Krey et al. , 2011). 
In what follows, DFT and IDFT will denote, respectively, the discrete Fourier and discrete 
inverse-Fourier Transforms; as an abuse of notation, the same abbreviations will be used for 



the 1- and 2-dimensional transforms, with the choice being context dependent (Wood and 



Chan, 1994). 



A full discussion of why and how the DFT is used in matrix computations on block circulant 



6 



matrices is given in Chapter 2 of Rue and Held (2005), but for completeness a very brief 
summary follows. An n x n matrix A is said to be circulant if it has the following structure: 



A 



«0 0,1 • • • ttn-l 
fln-l OO • • • «n-2 



ai 02 



where a, belong to a field, for example the real numbers. The ordered set of elements, 
a = {aj}'j^i, is called the base of A. A real nm x nm block circulant matrix C is one with 
the following structure: 

Ao Ai ■■■ Am-l 

Arn-l Aq • • • 

Ai A2 ■■■ Ao 

where each Ai is a circulant matrix with base = {aij}^!^. The matrix 



aoi 



an 



OO(n-l) CH{n-l) 

is known as the base matrix of C. 



O{m-l)0 
«(m-l)l 

a(m-l)(n-l) 



The eigenvectors of a circulant matrix A (as defined above) can be written as 



n-l 



Ej = Oj exp{— 27riij/n}. 



i=0 



The complete set of eigenvectors are stored in columns of the (unitary) discrete Fourier 
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transform matrix, 



F 



1 1 
1 

1 UJ 



UJ 
2 



UJ 



UJ 



UJ 



1 

2(n-l) 



1 w"-! a;2(n-i) 



(n-l)(n-l) 



where w = exp{27ri/n}. Now, let be a diagonal matrix with the eigenvalues {Ej}^~Q on 
the leading diagonal. By expanding the matrix product analytically it is straightforward to 
verify that the matrix A has spectral decomposition, A = FEF^ , where the superscript H 
denotes the conjugate transpose. The most useful aspect of the matrix F is that matrix- 
vector products Fv and F^v are available directly as the DFT and IDFT, respectively, of 
the vector v. The vector E is available as ii^ = y/nFd and matrix square roots are computed 
from expressions such as A^^"^ = FE^^'^F^ . These results massively simplify computation 
with ordinary circulant matrices; furthermore, the theory extends to block circulant matrices, 
where the 2-dimensional DFT and IDFT are used. 

In what is to follow the 'full covariance matrix' will mean the covariance matrix of y^*^ i.e. 
Scxt- For the application in mind, namely the spatial log-Gaussian Cox process, it has been 



argued that Markov chain Monte Carlo using the full covariance matrix is inefficient (Rue 



et al. , 2009; Simpson et al. , 2011). The suggested alternative is to use the integrated nested 



Laplace approximation to perform approximate Bayesian inference with a sparse GMRF 
approximation to the full covariance matrix so as to reduce computational cost. We now 
discuss the construction of such a GMRF approximation. 

For the full covariance matrix defined by an arbitrary choice of correlation function, the 
dependence structure is 'dense': there is no sparse conditional independence structure - the 
precision matrix Qext is a dense matrix. A GMRF approximation to Scxt is constructed 
by parametrising the precision matrix, Qcxt = Qcxt(^)i and choosing 6'opt to be such that 



ext V^'opt 



) ^ = Sext(^opt) is 'as close as possible' to Sext- The parametrisation considered here 
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Figure 1: Illustrating the choice of 1- and 2- neighbourhoods, respectively the left and right 
hand plots. The black square is the grid cell of interest and the grey cells are those specified 
to be the neighbours in the chosen level of dependence. 



is similar to that presented in Chapter 5 of Rue and Held ( 2005 ) , in which the neighbourhood 



of a grid cell consists of all the cells in the box surrounding the cell of interest, up to a 
certain distance away. Figure [T] illustrates what will be referred to in this article as 1- and 
2-neighbourhoods: the box obtained by specifying respectively dependence on all cells in the 
box a distance of up to 1 and up to 2 cells both vertically and horizontally around the cell 
of interest. For the 1-neighbourhoods, 9 has 3 elements, corresponding to the dependences 
between cells a distance apart, between directly adjacent cells and between diagonally 
adjacent cells. In a similar way, for the 2-neighbourhood dependence structure, 9 has six 
elements. 

Let ? be the base matrix of Sext and let q[9) be the base of the inverse of Qext, with base 
matrix ^{0). Note that, q{9) can be computed from ip{9) using the 2-dimensional discrete 
Fourier transform, 

m = , — ^-y— IDFT(DFT(V;(e)) ® (-1)), 

length{vec[^/'(t^)J j 

where X® (—1) denotes the raising of each element of matrix X to the power —1 and vec(X) 
is the vector obtained by stacking the columns of X on top of each other. The optimal 9 is 
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found as ming {U{6)} where, 



2 



where a subscript ij denotes the (z, j)-element of the matrix, 

1 ifii = (0,0) 

Here, d{i,j) is the distance from cell to the reference origin (0,0) and a is a constant. 



set equal to 1 in the experiments below, see Rue and Held (2005) for a justification of this 
choice. 

Differentiating U with respect to the kth component of the parameter, gives 



ddk 



which can also be computed using the DFT, since 



2x:^'«fe-«(«).j)^, 



where denotes element-wise multiplication, and the matrix gf^V'(^) is a matrix of I's where 
9k appears and O's otherwise. 

With the above ingredients, standard software such as the optim function in R can be used to 
compute optimal parameters. In particular, the availability of the gradient function enables 
the user to take advantage of gradient-based optimisation methods such as the BFGS method 
implemented in optim. A sensible starting point for the optimiser is given by the base matrix 
of the diagonal precision matrix, diag(l/cj^). 

The reader who is daunted by the prospect of implementing the above functions should note 
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that this has already been done in the Igcp R package (Taylor et al. 



2011). 



2.2 Simulation Study 

We performed a simulation study to investigate the ability of the algorithm detailed above 
to approximate a spatially continuous Gaussian process, and hence choose an appropriate 
neighbourhood size for use later in the article (see Section |6]). 

Given a set of parameters of the latent field, an observation window and a grid size, it is 
possible to compute the base matrix of Scxt- With the same inputs and an optimisation 
step, it is also possible to compute the base matrix of the sparse representation, I]ext(6'opt)- 
A measure of the performance of the approximation is given by the mean square error in 
simulating Gaussian random variables. Given a vector of standard Gaussian variates, F, a 
realisation of a Gaussian field with mean fj, and the correct second order properties is given 

1/2 ~ 1/2 

by y = /i + Sp^^.r, whilst an approximation of the field is y = fi + T,^!^^T. To compare how 
well the field has been approximated, an appropriate measure is the integrated mean square 
error. We estimate this from a repeated sequence of n independent realisations of T as 

n M M 

i=i j=i k=i 

where yijk is the value of the {j, k) cell of y for the ith realisation of T and M is the number of 
grid cells in each direction (here, the grid is assumed to be square). Figure[2]is a visualisation 
of a true field and two possible sparse approximations using 1- and 2-neighbourhoods. 

For the simulation study, the value of a was fixed at 1 and (p allowed to vary between 0.025 
and 0.2 with the unit square as the observation window. The parameter a was fixed because 
the computed mean square errors would simply scale linearly with u^. The range of values 
of (j) was chosen to include a selection of scenarios that might be encountered in practice; the 
important factor is the size of (j) with respect to the observation window and the grid. When 
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Figure 2: Simulated Gaussian fields. Middle plot: the true field with full covariance struc- 
ture; left-hand plot: approximate field with a 1-neighbourhood; right-hand plot: approximate 
field with a 2-neighbourhood. The simulation took place on a 512 x 512 grid, the optimisation 
took respectively 64 and 355 seconds to compute the parameters of the 1- and 2- neighbour- 
hoods. The result from the 2-neighbourhood is virtually indistinguishable to the eye from 
the true field, whereas the 1-neighbourhood has a grainy appearance. The respective MSB's 
were 0.38 and 0.007. 

(j) is small compared to the size of the grid, the cells become approximately independent. It 
should be harder to obtain a good approximation to the latent field for larger values of as 
in this case spatial dependence can be significant for cells a moderate distance apart on the 
grid. To reduce the possibility of results depending on an artefact of the choice of grid size, 
two different resolutions were used: 128 x 128 and 256 x 256 in the extended space. For the 
comparison of x and x, the parameter was set to zero. 

Table [T] presents the results of the simulation study. There was a significant improvement in 
the approximation using a neighbourhood size of 2 compared with a neighbourhood size of 1. 
The 3-neighbourhood approximations did not appear to improve the quality of the estimated 
latent field, but were more robust: for the 2-neighbourhood results, the BFGS option within 
optim appeared to converge to a local sub-optimal value on two of the occasions, but improved 
results were obtained by the Nelder-Mead simplex option in these cases. 

The main conclusions from this study are as follows: (1) it is possible to compute a very good 
sparse approximation to a given Gaussian field for sensible values of (j) compared with the size 
of the observation window; (2) the optimisation time is quick: around a minute for 256 x 256 
grids; (3) the optimisation step is not always robust, so in practice, building in a simulation 
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Ext. Grid 


<t> 


MSE 1 


MSE 2 


MSE 3 


Bias 1 


Bias 2 


Bias 3 


128 


0.025 


0.086 (0.6) 


0.014 (6.2) 


0.023 (9.7) 


1.179 X 10"* 


3.86 X 10"'' 


1.424 X 10"^ 


128 


0.05 


0.174 (1.1) 


0.025 (6.4) 


0.062 (9.8) 


5 X lO""* 


1.333 X 10"* 


0.618 X 10~* 


128 


0.1 


0.283 (1.1) 


0.016 (6.5) 


0.042 (9.8) 


0.002 


4.332 X 10"* 


1.891 X 10"* 


128 


0.15 


0.358 (1) 


0.018 (6.7) 


0.03 (9.8) 


0.003 


0.001 


4.338 X 10"* 


128 


0.2 


0.419 (1.5) 


0.031 (6.8) 


0.064 (9.9) 


0.004 


0.001 


0.001 


256 


0.025 


0.173 (5.2) 


0.024 (33.6) 


0.005 (58.5) 


-1.92 X 10""* 


-0.535 X 10"* 


-1.005 X 10"^ 


256 


0.05 


0.278 (5.6) 


0.024 (35.4) 


0.034 (59.7) 


-0.001 


-1.848 X 10"* 


-0.703 X 10"* 


256 


0.1 


0.389 (7.7) 


0.041 (35.4) 


0.069 (61.5) 


-0.002 


-0.001 


-3.078 X 10"* 


256 


0.15 


0.462 (6.6) 


1.677* (32.7) 


0.051 (61) 


-0.004 


-0.003 


-0.001 


256 


0.2 


0.515 (8.5) 


1.826** (27.1) 


0.051 (58.6) 


-0.004 


-0.003 


-0.001 


Table 1: 


Table illustrating 


the ability of 


a GMRF to approximate a 


GF on a unit 


square 



observation window. 'Ext. Grid' is the size of the DFT grid used, giving respective output 
grid sizes of 64 x 64 and 128 x 128; 'MSE 1-3' denotes the mean square error for respective 
neighbourhood sizes 1-3, with computation time in seconds in parenthesis; and 'Bias 1- 
3' is the mean bias. Optimisation was performed using the 'BEGS' method in R's optim 
command. Using the Nelder-Mead simplex method, the two exceptional values, marked * 
and ** had improved MSE's of 0.284 and 0.378. On these occasions, it appeared that the 
BEGS optimiser had converged to a sub-optimal set of parameters. 

step such as described above, i.e. computing x and x for a range of E, is worthwhile in order 
to evaluate the approximation; finally (4) for the purposes of the simulation study in Section 
[6j the 2-neighbourhoods should be sufficiently accurate, because the values of </> considered 
in Section [6] are sufficiently small compared with the observation window. 



3 The Spatial Log-Gaussian Cox Process 



Let C be an observation window in space. Events occur at spatial positions s ^ W 
according to an inhomogeneous spatial Cox process with intensity function R{s). Conditional 



on i?, the number of events, Xs, occurring in any C 1^ is Poisson distributed (M0ller et al 



1998), 



Poisson <; / R{s)dt 



(2) 



Following Diggle et al. (2005), we decompose the intensity multiplicatively as. 



R{s) = ^x\{s)e^v{y{s)]. 



(3) 
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In Equation [3| the fixed spatial component, A : M i— )• M>o, is a known function, proportional 
to the average intensity of the process at each point in space and scaled so that 



/ X{s)ds 
Jw 



(4) 



whilst /i is the expected number of events in W. The function y captures the residual 
variation and is a continuous spatial Gaussian process. The components A and fx define the 
expected behaviour of the point process, whereas y determines the residual variation. When 
A and /u are known, it is this residual variation that is of statistical interest: inference about 
y gives information on extraneous spatial clustering of events, with many applications in 



ecology and epidemiology (M0ller et al. , 1998 Diggle et al. , 2005, Rue et al. 2009; Simpson 



et al., 2011). 



In this article, y, is second order stationary with minimally-parametrised covariance function. 



coy[yisi),y{s2)] = a\i\\s2 - siH/^/-)}, 



(5) 



where a and (j) are known parameters. The parameter a scales the log-intensity, whilst the 
parameter (p governs the rate at which the correlation function decreases in space. The mean 
of the process y is set equal to —a'^/2 so as to give E[exp{3^}] = 1. 



Following M0ller et al. (1998), Brix and Diggle (2001) and Diggle et al. (2005) we will use 



a discretised version, Y, of the above model, defined on a regular square grid. Strictly, 
observations X arising from the model are then cell counts on the grid, the intensity of the 
process being treated as constant within each cell. In practice, the aim is to make the lattice 
spacing sufficiently fine that cell-counts greater than 1 are rare and the error introduced by 
piece-wise constant approximation to the intensity is negligible. 

Recall that the discretised y is a finite collection of random variables, so the properties of 
y imply that Y has a multivariate Gaussian density with approximate covariance matrix S. 
The elements of S are calculated by evaluating Equation [5] at the centroids of the spatial grid 
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cells. 



The next two sections present alternative methods for predictive inference of Y conditional 
on the observed data X. 



4 Markov Chain Monte Carlo 

The main advantage of the MCMC approach to inference for this model is that the 'full co- 
variance matrix' of Y is used (compare with Section [s]). Predictive inference about Y requires 
samples from the conditional distribution of the latent field, Y, given the observations, X, 
namely 

tt{Y\X) (xtt{X\Y)tt{Y), (6) 

In order to evaluate 7t{Y) in Equation [6| the parameters of the process Y must either be 
known or estimated from the data. Estimation of a and (j) can be achieved either in a 
Bayesian or likelihood-based framework, or by one of a number of more ad hoc methods e.g. 



Brix and Diggle (2001) and Diggle et al. (2005). These ad- hoc methods are described and 



implemented in the R package Igcp, see Taylor et al. (2011). 



Markov chain Monte Carlo methods work by drawing a sequence of dependent samples from 
a target density of interest in situations where it is not possible to draw independent samples 
directly (e.g. via inversion of the cumulative distribution function). In the limit as the number 
of draws tends to infinity, the resulting output behaves as if it were a sample from the density 
of interest. In practice, if the chain appears to have converged to its stationary distribution 
and is mixing well, that is to say that the autocorrelation of the generated sample is low, then 
reliable inference can be can be made with relatively short runs. However, for non-trivial 
applications, the chain often mixes poorly, and it is difficult to tell whether stationarity has 
been achieved without prohibitively long runs to double-check that this is the case. Reviews 



of MCMC methodology include Gilks et al. (1995) and Gamerman and Lopes (2006). 
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The most well known MCMC method, the Metropolis Hastings (MH) algorithm (Metropolis 



et al. , 1953; Hastings, 1970) is "arguably the most successful and influential Monte Carlo 



[technique]" (Girolami and Calderhead, 2011); along with Gibbs sampling, it is probably 
the most frequently employed method in the literature. Let tt be the target density. Given 
a current state in the chain, Y, a new state, Y' , is proposed from any probability density, 
q{Y,Y') = F{Y'\Y), and accepted with probability. 



a{Y,Y') = minjl, ^ 



7T {Y'\X)qiY',Y) 
(Y\X) q{Y,Y') 



(7) 



A crucial step in designing an effective sampling regime is the choice of proposal kernel q. 
The word 'choice' both describes the specific probability density function associated with q, 
as well as any tuning parameters, h, that are associated with it. A very simple example 
is a random- walk kernel, where q{Y,Y') ~ N(y, /i^). For inference with the log-Gaussian 
Cox process, this article will focus on a more sophisticated, but well-studied version of the 
Metropolis-Hastings algorithm, the Metropolis- Adjusted Langevin algorithm. 



Following M0ller et al. (1998), Monte Carlo simulation from 7r(y|Ar) is made more efficient 
by working with a linear transformation of Y^^^. Writing F = T,^-^/'^ (Y'^^^ — cr^/2), the target 
of interest is given by, 

^(F|X)a7r(X|y-*)7r(F), (8) 

When the gradient of the transition density can also be written down explicitly, a natural 
and efficient MCMC method for sampling from the predictive density of interest (Equation 
[s]), is a Metropolis-Hastings algorithm with a Langevin-type proposal ( Roberts and Tweedie[ 



1996, M0ller et al. 1998) 



9(r,r') 



N 



r';T + -Vlog{Trir\X)}, hH 



where N(y; m, v) denotes a Gaussian density with mean m and variance v evaluated at y, I 



is the identity matrix and /i > is a scaling parameter (Metropolis et al. , 1953 Hastings 
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19701). 



Various theoretical results exist concerning the optimal acceptance probability of the MALA 



(Metropolis- Adjusted Langevin Algorithm); see Roberts and Rosenthal (1998) and Roberts 



and Rosenthal (2001). In practical applications, the target acceptance probability is often 



set to 0.574, which would be approximately optimal for a Gaussian target as the dimension 
of the problem tends to infinity. An algorithm for the automatic choice of h so that this 
acceptance probability is achieved without disturbing the ergodic property of the chain is 



detailed in Andrieu and Thoms (2008), and is implemented in the Igcp R package (Taylor 



et al., 2011). 



In our case, the log of the target density (Isj) is given by 



lo 



g{7r(7|X)} = constant -\m\^ + {y{s)X{s) - fiCAX{s) exp(y(s))} , 



where S is the set of grid cells in the observation window, Ca is the area of the individual 
grid cells and X{s) is the number of events in cell s. The gradient can be computed using 
the DFT, 



Vlog{7r(7|X)}(5) = -7(5)+ 



1 



length{vec[y'=^t]} 



DFT{e1/2 q iY)YT[x - ^CAAexp(y"^*)]}) (s). 



where in this case, s is a cell in the extended grid. 



5 The Integrated Nested Laplace Approximation 



Further details on the material in this section are given in |Rue et al.| ( [2009l >. The G aussian 



Markov random field/integrated nested Laplace approximation (GMRF/INLA) approach 
to inference for spatial log-Gaussian Cox processes relies on two different approximations. 
Firstly, an approximation to the full covariance matrix of Y is obtained, yielding substantial 
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computational benefits (see Section [2]). Then, this approximate covariance takes the place 
of the full covariance in the model, and Bayesian inference is performed using the integrated 
nested Laplace approximation. 

The integrated nested Laplace approximation delivers approximate inference for the posterior 
marginals of the latent field given the observed data and also for parameters of the latent 
field. Let •& be hyperparameters of the latent field y, noting that these are different to the 
parameters </>) mentioned above. INLA is based on standard results for marginal and 
conditional densities: 



■n{Y{s)\X) = j 7r(y(s)|T9,X)7r(i?|X)dT9, 
7r(i9fc|X) = / 7r(t?|X)di?_fc, 



Where, conditional on the observed data, 7r(y(s)|X) is the posterior marginal density of the 
latent field at cell s and T:{'dk\X) is the marginal posterior density of the A;th component of 
^?. With INLA, these exact relations are replaced by the approximations. 



%{Y{s)\X) = J n{Y{s)\^,X)iT{^\X)d^, 

TT{^k\X) = [ 7f(l?|X)dl9_fc, 



where vf denotes an approximation to a probability density function (pdf ) . The approximation 
of the joint density of the hyperparameters "!? uses 



TTimX] (X 



7Tg{Y\^,X) 



Y=Y*(^) 



where ttg denotes a Gaussian approximation to a density and Y*'{'d) is the mode of the full 
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conditional for Y. Lastly, the remaining undefined pdf is also approximated by a Gaussian, 



To evaluate the marginal posteriors. Rue et al. (2009) suggest a clever quadrature scheme 



over iD. Optimised numerical routines for performing the above computations, and a large 
suite of other methods not covered in this article, are implemented in the R INLA package 



available from |www . r- inla . org (Rue et al. 2009). 



For the simulation study below, Bayesian inference was performed using INLA's genericO 
model. The default method for fitting this model includes a specification for the precision 
matrix, rQext, where r is a hyperparameter with a log-gamma prior. For the application 
considered here, the parameter r is redundant because, having parametrised and estimated 
the precision matrix (as detailed in Section [2]), Qext is a known quantity. The additional 
uncertainty induced by the default choice of prior was removed by fixing r = 1. Fixing r also 
considerably speeds up the time taken to fit the model. Since for the simulations, the mean 
/i is known, the model was fitted without an offset (for those familiar with the INLA package, 
form <- X~ -1 + f (. . .)). 



6 Simulation Results 
6.1 The Simulated Scenarios 

We simulated a total of 18 scenarios on a unit square observation window. We generated two 
different fixed spatial components, Ai(s) and A2(s), by scattering 200 points uniformly on 
the unit square and fitting a fixed-bandwidth bivariate Gaussian smoothing kernel to these, 
with respective standard deviations 0.04 and 0.1. 

For each of the fixed spatial components, we used an exponential covariance model with all 



19 



combinations of (p =0.02, 0.04, 0.6 and a = 0.5, 1, 2. Since the latent field is exponentiated in 
the model, increasing values of a make predictive inference more challenging for both MALA 
and INLA. We also hypothesised that larger values of (p would present a greater challenge for 
the GMRF approximation step, and hence for the approximate inference step using INLA. 

Note that for the full Gaussian field comparison below, we use the word 'INLA' to mean 
'a GMRF approximation to the latent field followed by inference via the integrated nested 
Laplace approximation', this is distinct from the R library INLA. 

Simulation and prediction took place on an identical grid size of 128 x 128 in the extended 
space. This enabled a direct comparison of the true Y used to generate the data, and the 
predicted Y\X generated from the fitting processes. All computations were carried out on a 
3.2GHz Intel(R) Core(TM) i5 desktop PC with 4Gb RAM. 



6.2 Algorithm Parameters 



For inference with MCMC, the sampler was set to run for 100,000 iterations with a 10,000 
iteration burn-in and every 90th sample retained. The choice of 100,000 iterations was 
primarily for computational reasons: pilot runs had shown that results could be generated 
relatively quickly with this choice (in under 20 minutes), whilst the issue of convergence will 



be discussed in Section 6.3 The chain was initialised with F set to zero, corresponding to the 
mean of the field Y, i.e. — cj^/2. For the adaptive MCMC, we used a method introduced by 



Andrieu and Thoms (2008) that is built into the Igcp package. This uses a Robbins-Munro 



stochastic approximation to adapt the tuning parameter of the proposal kernel ( Robbins and 



Munro, 1951). At each iteration of the sampler, the tuning parameter is updated according 

/,(^+i) = /,»+,^(^+i)(aW-«„p,), 



to the iterative scheme, 
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where h{i) and a^*) are the tuning parameter and acceptance probabihty at iteration i and 
Oopt is the target acceptance probabihty. The sequence {?7^*^} is chosen so that Yl't^oV^^^ is 
infinite but for some e > 0, X]£o {v^^^)^^'' is finite. These two conditions ensure that any 
value of h can be reached, but in a way that maintains the ergodic behaviour of the chain. 
The class of sequences used in this simulation study was r/^*) = C/i" where a = 0.5 and 
C = 1; the scheme wa,s initia-lisGci with, h — 1 cuid set to target ttopt — 

0.574. 

For inference with INLA, we compared a GMRF approximation using neighbourhood sizes 
of both 1 and 2. The first two methods employ a Gaussian approximation to the target 
(in INLA this corresponds to strategy="siniplif ied.laplace"), which will henceforth be 
referred to as GAINl and GAIN2 ('GMRF Approximation followed by INLA'). The third 
method, GAIN3, employed a neighbourhood size of two and the full Laplace approximation 
to the target (strategy="laplace"). Fitting was achieved with INLA's genericO model, as 
discussed in Section [Sl 

6.3 Results 

Mean computation times over the 18 scenarios are given in Table [2| these show that GAINS 

was the slowest, running in 39 minutes; MALA was next, running in 20 minutes; followed by 

GAIN2, in around 4 minutes; and GAINl, which ran in around 1.5 minutes. These timings 

are based on the version of the INLA package as was available on 8^^ February 2012, we have 

since been informed by the authors that recent work on the laplace strategy could further 

reduce computation time by 30-50%. 

MALA GAINl GAIN2 GAIN3 

Mean Comp. Time (mins) 20.236 1.443 4.292 39.147 
Variance 5.245 0.043 2.318 13.764 

Table 2: Mean and variance of computation time for each algorithm. 

Two measures were used to compare the performance of each method: the mean square error 
in estimating the latent field and a measure based on estimating probabilities. Details of the 
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Scenario a (j) MALA Last h MSE 

MALA GAINl GAIN2 GAIN3 



1 


0.5 


0.02 


0.099 


0.211 


0.214 


0.213 


0.212 


3 


0.5 


0.04 


0.104 


0.175 


0.185 


0.185 


0.188 


5 


0.5 


0.06 


0.103 


0.179 


0.175 


0.175 


0.173 


7 


1 


0.02 


0.082 


0.698 


0.717 


0.707 


0.7 


9 


1 


0.04 


0.052 


0.548 


0.583 


0.553 


0.55 


11 


1 


0.06 


0.049 


0.443 


0.481 


0.442 


0.435 


13 


2 


0.02 


0.015 


2.314 


2.378 


2.376 


2.306 


15 


2 


0.04 


0.009 


1.918 


2.147 


1.942 


1.913 


17 


2 


0.06 


0.004 


1.737 


2.371 


4.382 


6.792 



2 


0.5 


0.02 


0.108 


0.224 


0.225 


0.225 


0.224 


4 


0.5 


0.04 


0.103 


0.178 


0.182 


0.182 


0.182 


6 


0.5 


0.06 


0.098 


0.153 


0.16 


0.16 


0.161 


8 


1 


0.02 


0.081 


0.636 


0.655 


0.64 


0.646 


10 


1 


0.04 


0.066 


0.56 


0.598 


0.563 


0.559 


12 


1 


0.06 


0.06 


0.511 


0.516 


0.509 


0.492 


14 


2 


0.02 


0.014 


2.631 


2.654 


2.711 


2.605 


16 


2 


0.04 


0.013 


1.954 


2.078 


2.053 


1.937 


18 


2 


0.06 


0.003 


1.867 


2.306 


4.091 


6.241 



Table 3: Simulation results: dataset parameters, last value of h from MALA and mean square 
errors in predicting the true field. The upper half of the table gives results for fixed spatial 
Ai(s), the lower half results for fixed spatial A2(s). 



22 



parameters for each scenario, the last value of h in the MALA run, and mean square errors 
are shown in Table [3j The mean square error was calculated as, 

where Min is the number of cells inside the observation window and the summation takes place 
over these cells. The results show that MALA gave marginally better point-wise prediction of 
the mean field. With the exception of scenarios 17 and 18, there is a tendency for INLA with 
a higher order neighbourhood structure to perform better, though the difference between the 
two implementations was only slight. These MSE's are presented to give the reader a sense of 
the increasing difficulty of the datasets. They cannot be used for comparing the algorithms be- 
cause they contain no measure of estimated uncertainty in the latent field. In order to account 
for this uncertainty, we now introduce a measure of predictive ability. For each cell s inside 
the observation window, both MALA and INLA can produce a set of estimated quantiles, in 
this case for probabilities Qk & q = {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99}, 
that is a to say a set of thresholds, Cfc(s), satisfying, 

nY{s)<Ck{s)\X]=qk. 

Each algorithm yields a different set of estimated thresholds for each cell, say c^^\s) = 
c(2)(s) = {cf^(s)} and c^^\s) = {cf^(s)} respectively for MALA, GAINl and 
GAIN2. Now, for / G {1,2,3} we define 

zl!\s) = I[Y{s)<4\s)], 



where I is the indicator function. Then, Z^'^(s) is the indicator function of whether or not 
the true field in cell s was below the inferred threshold, c'l\s), from method /. If c^p{s) is an 
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unbiased estimator of the true threshold, c^j^'^^is), then E[Z^ (s)] = q^, since, 

qk = nY{s) < cTis)] = E {f[Y{s) < c^l^{s)\X]} = E {e{I[Y{s) < c1\s)]\X}} = E[zj!\s)]. 



The two measures of predictive ability are bias and mean squared error in estimating the 
probabilities, qk, over the whole observation window. The 'predictive mean square error' was 
computed as, 

MSE.^i|:{^Ei*-4«(»)i| 

The results are given in Table |4j 



;enario 


GAINl 


GAIN2 


GAIN3 


Scenario 


GAINl 


GAIN2 


GAIN3 


1* 


2.697 


2.258 


1.594 


2* 


14.192 


10.666 


11.747 


3* 


2.801 


2.748 


3.553 


4* 


6.454 


6.499 


7.065 


5 


0.645 


0.646 


0.53 


6* 


3.149 


3.15 


3.719 


7* 


4.239 


4.688 


0.762 


8* 


4.077 


1.49 


6.719 


9* 


3.309 


1.981 


0.91 


10* 


2.449 


1.34 


0.471 


11* 


1.156 


1.045 


0.614 


12 


0.374 


0.777 


0.434 


13* 


26.929 


82.41 


0.531 


14* 


2.721 


9.693 


0.692 


15* 


84.014 


60.145 


0.558 


16 


0.457 


3.62 


1.131 


17* 


3.907 


8.796 


17.63 


18* 


36.912 


135.375 


311.099 



Table 4: Mean square error in estimating probabilities, MSE2, using each of the three INLA 
approximations, relative to MSE2 for the MALA algorithm. A * in the first column indicates 
the scenarios where MALA outperformed GAINl and GAIN2. The left table are the results 
for fixed spatial Ai(s) and the right table gives the values for fixed spatial A2(s). 

These results show that GAINl and GAIN2 outperformed MCMC in only two of the scenar- 
ios considered (5, 12 and 16). In the remaining scenarios, the relative mean squared error 
of GAIN2 varied from 1.045 to 82.41, the median being 3.38. Increasing the neighbourhood 
size from 1 to 2 did not lead to a median decrease in relative MSE, computed as the me- 
dian of MSE2(GAIN2)/MSE2 (GAINl). The median decrease in relative MSE comparing 
MALA to GAIN3 was 1 (the median of MSE2(GAIN3)/MSE2(MALA)), although GAIN3 
performed better in the scenarios with fixed spatial Ai(s) compared with those with A2(s), 
with respective values of 0.76 and 3.72. 
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The main source of these observed differences in predictive mean squared error is bias in the 
estimated probabihties. Figure [3] shows plots of estimated versus true probabilities in each of 
the 18 scenarios for each of the algorithms. The bias in these plots over the scenarios for both 
INLA algorithms is apparent in the 'S' shape around the line y = x, whereas for MCMC, 
the results are approximately symmetrically distributed about it. Additionally, these plots 
show that in two of the scenarios, both GAIN2 and GAIN3 would not give good estimates of 
probabilities. 



MALA GAIN1 




Figure 3: Illustrating the estimation of quantiles qk by those inferred by each algorithm 
qk = E[Z^\s)] in scenarios 1-18. MALA in the top left plot, GAINl in the top right plot 
and GAIN2 in the bottom left plot and GAIN3 in the bottom right hand plot. For the 
MALA plot, the estimated quantiles are distributed about the line 'y = x\ whereas for the 
INLA-based methods, these fall in an 'S' shape around that line. 

Our convergence diagnostics for MALA consisted of examining the cell- wise lag-1 autocorre- 
lation of the MCMC chain. Example plots are shown in Figure |4j these were typical of the 
images produced in the blocks of scenarios 1-6, 7-12 and 13-18. The main feature of these 
plots is that in every case, the chain is mixing better (i.e. has lower lag-1 autocorrelation) in 
event-rich areas. Strictly, one should consider the mixing properties across all cells simulta- 
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neously, in which case the conclusion from this diagnostic would be that we would trust the 
results from scenarios 1-12, but worry about those from 13-18. However, due to a judicious 
choice of initial values for the chain (recall that T was initially set to zero), the effect of 
slow mixing in event-poor areas has less effect on the resulting inference than it might with 
a different set of initial values. Unconditionally, K{Y) = —a'^/2, so the chain in event-poor 
areas (sensibly) stays close to this value. In fact, a similar phenomenon occurs in in INLA: in 
event-poor areas, the prediction surface tends to —a'^/2. Despite the apparently slow mixing 
in scenarios 13-18, MALA still produces better predictive inference in each of these cases. 



J. --.I 



i: 

- 

- 

I 




Figure 4: Image plots of lag-1 autocorrelation of the MCMC chain; the plots correspond to 
scenarios 4 (left), 10 (middle) and 16 (right). The plots for scenarios in respective blocks 1-6, 
7-12 and 13-16 are similar in appearance. The box in the centre of the plot is the observation 
window and the data themselves appear as points on each plot. 

To investigate whether the results for INLA were due to the approximation of the latent field, 
or whether they were due to error induced in the inference, we conducted a further simulation 
study. For each of the covariance functions defined by the parameter combinations in scenarios 
1-18 above, we constructed a 2-neighbourhood GMRF approximation to the full covariance, 
and then simulated 18 further scenarios (l'-18') based on Gaussian variables simulated from 
the GMRF. In the following, INLA2 and INLA3 employ the same INLA settings as GAIN2 
and GAIN3, but in this case there is no GMRF approximation step 



Results from the second simulation study are shown in Table 6.3 and Figure [5} There are 
two main conclusions to be drawn from these results. Firstly, in Figure [5j the 'S' shape is no 
longer apparent for the INLA methods, however in the plot for INLA2, there is a noticeable 
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INLA2 


INLA3 




INLA2 


INLA3 


I' 


n 207 


8 


2'* 


1.922 


n 91 3 


oM 
o 




637 


A'* 

rt 


3 381 


n 933 





n Ado 


i.UoO 





U.044 


1 1/11 

i.i4i 


7'* 


6.506 


0.726 


8'* 


15.142 


0.622 


9'* 


1.773 


1.759 


10'* 


55.562 


1.743 


11'* 


14.542 


0.632 


12' 


0.032 


1.365 


13'* 


167.657 


3.584 


14'* 


40.266 


6.144 


15'* 


2.115 


2.278 


16'* 


56.126 


2.5 


17'* 


2.708 


0.874 


18' 


0.545 


2.586 



Mean square error in estimating probabilities, MSE2, using each of the three INLA 
approximations, relative to MSE2 for the MALA algorithm. A * in the first column 
indicates the scenarios where MALA outperformed INLA2. The left table are the results for 
fixed spatial Ai(s) and the right table gives the values for fixed spatial A2(s). 

upward bias in the centre of the plot. Also from the plots, it is clear that MALA did not 
perform well on two occasions, and INLA on one. The main second conclusion from this 
simulation study is despite the fact that INLA now shows less bias, MALA nevertheless 
still outperforms both INLA2 and INLA3. The median relative increase in MSE comparing 
MALA to INLA2 was 3 over all scenarios and 1.1 comparing MALA to INLA3. As was the 
case for scenarios 1-18, MALA performs better for fixed spatial A2(s), with a median increase 
of 1.37 for INLA3 whereas for Ai(s), INLA3 performed better at a median increase of 0.87. 



Figure 5: Illustrating the estimation of quantiles qk by those inferred by each algorithm 
Qk = E[zj!\s)] in scenarios l'-18'. MALA in left plot, GAIN2 in the middle plot and GAIN3 
in the right hand plot. 
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7 Discussion 



In this article, we have provided a tour of the mathematical and statistical techniques be- 
hind spatial prediction for log-Gaussian Cox processes. We have independently evaluated a 
previously published method for approximating spatially continuous Gaussian processes with 
GMRF and conducted a critical comparison of two methodologies for predictive Bayesian 
inference in this class of models. 

A suite of functions (as well as wrapper functions for the approximate Bayesian predictive 



inference for INLA) have been made freely available in the Igcp R package (Taylor et al 



2011). Our restriction in this paper to spatial, rather than spatio-temporal prediction is 
primarily ease of exposition. However, and unsurprisingly, the spatio-temporal concept is 
computationally more demanding. In pilot runs, even INLA was found to be quite slow 
for spatio-temporal prediction on a regular grid, due to the hugely increased dimensionality 
of the problem and a corresponding increase in the complexity of dependence patterns in 
the precision matrix (see Section [2]). Furthermore, we have restricted our choice of MCMC 
methods to the Metropolis adjusted Langevin algorithm (MALA) rather than investigating 
more sophisticated sampling techniques such as Riemann manifold Langevin or Hamiltonian 



Monte Carlo (Girolami and Calderhead, 2011). Our reasons for this choice are ease of im- 



plementation, stability and the fact that MALA has been well studied in the literature, with 
various theoretical results available concerning practical implementation. The authors are 



aware that the methods of Girolami and Calderhead (2011) have better theoretical mixing 
properties. 

We have demonstrated that MCMC can yield more accurate estimates of predictive probabil- 
ities compared with INLA-based methods for this class of models, depending on the chosen 
settings. Furthermore the predictive probabilities from MCMC can be obtained compara- 
tively quickly, and show less bias compared with those from INLA. The inferential technique 
of producing a GMRF approximation to a Gaussian field and then performing inference via 
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the integrated nested Laplace approximation should therefore be regarded with appropri- 
ate caution. This article also opens up the question of the utility of gradient-based MCMC 
methods for inference in log-Gaussian Cox process models assuming a latent Gaussian Markov 
random field; the default method of inference for these models would appear to be a blocked 



Gibbs sampling strategy eg. Rue et al. (2009). 



We have not addressed the full capabilities of both software implementations: Igcp and INLA. 
In particular, INLA provides access to inference for a wide class of latent Gaussian models, 
whilst Igcp is restricted to spatial and spatio-temporal log-Gaussian Cox processes. Further- 
more, the INLA package also provides a framework for the estimation of hyperparameters, 
which Igcp does not; this is known to be a very challenging sampling task for MCMC. 
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